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Abstract Constrained clustering has been well-studied for algorithms such as K- 
means and hierarchical clustering. However, how to satisfy many constraints in 
these algorithmic settings has been shown to be intractable. One alternative to 
encode many constraints is to use spectral clustering, which remains a developing 
area. In this paper, we propose a flexible framework for constrained spectral clus- 
tering. In contrast to some previous efforts that implicitly encode Must-Link and 
Cannot-Link constraints by modifying the graph Laplacian or constraining the un- 
derlying eigenspace, we present a more natural and principled formulation, which 
explicitly encodes the constraints as part of a constrained optimization problem. 
Our method offers several practical advantages: it can encode the degree of be- 
lief in Must-Link and Cannot-Link constraints; it guarantees to lower-bound how 
well the given constraints are satisfied using a user-specified threshold; it can be 
solved deterministically in polynomial time through generalized eigendecomposi- 
tion. Furthermore, by inheriting the objective function from spectral clustering 
and encoding the constraints explicitly, much of the existing analysis of uncon- 
strained spectral clustering techniques remains valid for our formulation. We val- 
idate the effectiveness of our approach by empirical results on both artificial and 
real datasets. We also demonstrate an innovative use of encoding large number of 
constraints: transfer learning via constraints. 
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(c) Spectral clustering 
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(d) Constrained spectral clustering 



Fig. 1 A motivating example for constrained spectral clustering. 



1 Introduction 



1.1 Background and Motivation 



Spectral clustering is an important clustering technique that has been extensively 
stud ied in the ima g e pro c essing, data mining , and machine le arning communi- 
ties (|Shi and MalikI (|2000h : lvon Luxburd (|2007n : lNg et all (120011) 1. It is considered 
superior to traditional clustering algorithms like .ft'-means in terms of having deter- 
ministic polynomial-time solution, the ability to model arbitrary shaped clusters, 
and its equivalence to certain graph cut problems. For example, spectral cluster- 
ing is able to capture the underlying moon-shaped clusters as shown in Fig.[TJb), 
whereas .ft'-means would fail (Fig.[lja)). The advantage of spectral clustering has 
also been validated by many real-world applications , such as image seg mentation 
(Shi and Malik (^000)) and mining social networks ((white and SmvthI (fioOS'l'l. 

Spectral clustering was originally proposed to address an unsupervised learning 
problem: the data instances are unlabeled, and all available information is encoded 
in the graph Laplacian. However, there are cases where unsupervised spectral 
clustering becomes insufficient. Using the same toy data, as shown in (Fig. [TJc)), 
when the two moons are under-sampled, the clusters become so sparse that the 
separation of them becomes difficult. To help spectral clustering recover from an 
undesirable partition, we can introduce side information in various forms, in either 
small or large amounts. For example: 
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1. Pairwise constraints: Domain experts may explicitly assign constraints that 
state a pair of instances must be in the same cluster (Must-Link, ML for short) 
or that a pair of instances cannot be in the same cluster (Cannot-Link, CL for 
short). For instance, as shown in Fig. [ijd), we assigned several ML (solid 
lines) and CL (dashed lines) constraints, then applied our constrained spectral 
clustering algorithm, which we will describe later. As a result, the two moons 
were successfully recovered. 

2. Partial labeling: There can be labels on some of the instances, which are 
neither complete nor exhaustive. We demonstrate in Fig. |9] that even small 
amounts of labeled information can greatly improve clustering results when 
compared against the ground truth partition, as inferred by the labels. 

3. Alternative weak distance metrics: In some situations there may be more 
than one distance metrics available. For example, in Table[3]and accompanying 
paragraphs we describe clustering documents using distance functions based 
on different languages (features). ^^^^^^^^^ 

4. Trans fer of knowledge: In the context of transfer learning ( Pan and Yangj 



(|201(]|) 1. if we treat the graph Laplacian as the target domain, we could transfer 



knowledge from a different but related graph, which can be viewed as the source 
domain. We discuss this direction in Section [6.31 and [7.51 

All the aforementioned side information can be transformed into pairwise ML and 
CL constraints, which could either be hard (binary) or soft (degree of belief). For 
example, if the side information comes from a source graph, we can construct 
pairwise constraints by assuming that the more similar two instance are in the 
source graph, the more likely they belong to the same cluster in the target graph. 
Consequently the constraints should naturally be represented by a degree of belief, 
rather than a binary assertion. 

How to make use of these side information to im prove clustering falls into 
the area of constrained clustering tiBasu et~al ( 2008t )'l. In general, constrained 



clustering is a category of techniques that try to incorporate ML and CL con- 
straints into existing clustering schemes. It has been well studied on algorithms 
such as Jf-means clustering, mixture model, hierarchical clustering, and density- 
based clustering. Previous studies showed t hat satisfying a l l con straints at once 
( Davidson and Ravil (l2007a|)). incrementally ([Davidson et all ( 2007f) ) , or even prun- 



ing constraints (jDavidson and Ravil (|2007b[) ) is intractable. Furthermore, it was 



shown that algorithms that build set partiti ons incrementally (such as A'-means 
and EM) are prone to being over-constrained (jDavidson and Ravil (|2006l) '). In con- 



trast, incorporating constraints into spectral clustering is a promising direction 
since, unlike existing algorithms, all data instances are assigned simultaneously to 
clusters, even if the given constraints are inconsistent. 

Constrained spectral clustering is still a developing area. Previous work on this 
topic can be divide d into two categories , based on how t h ey enforce the constraints. 



topic can be aivide a into two categories , based on now t n ey eniorce tne constraints. 
The fi r st category (iK amvar et al' (^2003'):'X u et~al (|2005I) :Il u and Carreira-PerpinarJ 



(|2008l) ; IWang et ar ( |2009) ; Ji and Xu (20o|)) directly manipulate the graph Lapla- 
cian (or equivalently, the affinity matrix) according to the given constraints; then 
unconstrained spectral clustering is applied on the modified graph Laplac ian. The 
second category use constraints to rest r ict th e feasible solu t ion space (De Bie et all 
(|2004l ): IColeman et all (fiooj) : iLi et (|2009l) : and Shil (fioOU 12004) ). Existing 



methods in both categories share several limitations: 
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— They are designed to handle only binary constraints. However, as we have 
stated above, in many real-world applications, constraints are made available 
in the form of real-valued degree of belief, rather than a yes or no assertion. 

— They aim to satisfy as many constraints as possible, which could lead to inflex- 
ibility in practice. For example, the given set of constraints could be noisy, and 
satisfying some of the constraints could actually hurt the overall performance. 
Also, it is reasonable to ignore a small portion of constraints in exchange for a 
clustering with much lower cost. 

— They do not offer any natural interpretation of either the way that constraints 
are encoded or the implication of enforcing them. 



1.2 Our Contributions 

In this paper, we study how to incorporate large amounts of pairwise constraints 
into spectral clustering, in a flexible manner that addresses the limitations of 
previous work. Then we show the practical benefits of our approach, including 
new applications previously not possible. 

We extend beyond binary ML/CL constraints and propose a more flexible 
framework to accommodate general-type side information. We allow the binary 
constraints to be relaxed to real-valued degree of belief that two data instances 
belong to the same cluster or two different clusters. Moreover, instead of trying 
to satisfy each and every constraint that has been given, we use a user-specified 
threshold to lower bound how well the given constraints must be satisfied. There- 
fore, our method provides maximum flexibility in terms of both representing 
constraints and satisfying them. This, in addition to handling large amounts of 
constraints, allows the encoding of new styles of information such as entire graphs 
and alternative distance metrics in their raw form without considering issues such 
as constraint inconsistencies and over-constraining. 

Our contributions are: 

— We propose a principled framework for constrained spectral clustering that can 
incorporate large amounts of both hard and soft constraints. 

— We show how to enforce constraints in a flexible way: a user-specified threshold 
is introduced so that a limited amount of constraints can be ignored in exchange 
for lower clustering cost. This allows incorporating side information in its raw 
form without considering issues such as inconsistency and over-constraining. 

— We extend the objective function of unconstrained spectral clustering by encod- 
ing constraints explicitly and creating a novel constrained optimization prob- 
lem. Thus our formulation naturally covers unconstrained spectral clustering 
as a special case. 

— We show that our objective function can be turned into a generalized eigenvalue 
problem, which can be solved deterministically in polynomial time. This is a 
major advantage over constrained A'-means clustering, which produces non- 
deterministic solutions while being intractable even for K = 2 ( Drineas et all 
dioQi); Davidson and Ravi (2007b)). 

— We interpret our formulation from both the graph cut perspective and the 
Laplacian embedding perspective. 
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We validate the effectiveness of our approach and its advantage over existing 
methods using standard benchmarks and new innovative applications such as 
transfer learning. 



This paper is an extension of our previous work ()Wang and DavidsonI (|2010l )') 



with the following additions: 1) we extend our algorithm from 2- way partition 
to K-w&Y partition (Section 16. 2p : 2) we add a new geometric interpretation to 
our algorithm (Section 15. 2^ : 3) we show how to apply our algorithm to a novel 
application (Section 16. 3p . namely transfer learning, and test it with a real- world 
fMRI dataset (Section l7.5p : 4) we present a much more comprehensive experiment 
section with more tasks conducted on more datasets (Section 17. 21 and 17. 411 . 

The rest of the paper is organized as follows: In Section [2] we briefly survey 
previous work on constrained spectral clustering; Section [3] provides preliminaries 
for spectral clustering; in Section U] we formally introduce our formulation for 
constrained spectral clustering and show how to solve it efficiently; in Section[5]we 
interpret our objective from two different perspectives; in Section[6]we discuss the 
implementation of our algorithm and possible extensions; we empirically evaluate 
our approach in Section [TJ Section |8] concludes the paper. 



2 Related Work 

Constrained clustering is a category of methods that extend clustering from unsu- 
pervised setting to semi-supervised setting, where side information is available in 
the form of, or can be converted into, pairwise constraints. A number of algorithms 
have been proposed on how to incorporate constraints into spectral clustering, 

which can be grouped into two categories. 

Th e first category manipulates the graph Laplacian directly. iKamvar et all 
( 2003t ) proposed the spectral learning algorithm that sets the (i, j)-th entry of 



the affinity matrix to 1 if there is a ML between node i and j; for CL. A 
new graph Lapl acian is then computed based on the modified affinity matrix. In 



( Xu et all (j2005)), the constraints are encoded in the s ame way, but a ran dom walk 



matrix is used instead of the normalized Laplacian. iKulis et all ( 20051) proposed 



to add both positive (for ML) and negative (for CL) penalties to the affinity ma- 
trix (they then used kernel _R'-mea ns, instead of s pectral clustering, to find the 
partition based on the new kernel). ILu and Carreira -Perpinan (2008) proposed to 



propa gate the constraints in the affinity matrix. In Ji and Xul (|2006l) : IWang et a] 



( 2009h . the graph Laplacian is modified by combining the constraint matrix as a 



regularizer. The limitation of these approaches is that there is no principled way 
to decide the weights of the constraints, and there is no guarantee that how well 
the give constraints will be satisfied. 

The second category ma nipulates the eigenspace directly. For example, the 
subspace trick introduced bv IPe Bie et all ( 2004f) alters the eigenspace which the 



cluster indicator vector is proje cted onto, based on the given constraints. This 
technique was l ater extended in Coleman et all ( 20081) to accommodate inconsis- 



tent constraints. IYu and Shi' ('2001', '2004') encoded partial grouping information as 
a subspace projection. iLi et al (2009) enforced constraints by regularizing the spec- 
tral embedding. This type of approaches usually strictly enforce given constraints. 
As a result, the results are often over-constrained, which makes the algorithms sen- 
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Table 1 Table of notations 



Symbol 


Meaning 


g 


An undirected (weighted) graph 


A 


The affinity matrix 


D 


The degree matrix 


I 


The identity matrix 


L/L 


The unnormalized/normalized graph Laplacian 


Q/Q 


The unnormalized/normalized constraint matrix 


vol 


The volume of graph Q 



sitive to noise and inconsistencies in the constraint set. Moreover, it is non-trivial 
to extend these approaches to i ncorporate soft constraints. 

In addition. IGu et ail ( 201lh proposed a spectral kernel design that combines 
multiple clustering tasks. The learned kernel is constrained in such a way that 
the data distributions of any two tasks are as close as possible. Their problem 
setting differs from ours because w e aim to pe r form single-task clustering by us- 
ing two (disagreeing) data sources. IWang et all (|2008l) showed how to incorporate 
pairwise constraints into a penalized matrix factorization framework. Their matrix 
approximation objective function, which is different from our normalized min-cut 
objective, is solved by an EM-like algorithm. 

We would like to stress that the pros and cons of spectral clustering as com- 
pared to other clustering schemes, such as A'-means clustering, hierarchical clus- 
tering, etc., have been thoroughly studied and well established. We do not claim 
that constrained spectral clustering is universally superior to other constrained 
clustering schemes. The goal of this work is to provide a way to incorporate con- 
straints into spectral clustering that is more flexible and principled as compared 
with existing constrained spectral clustering techniques. 



3 Background and Preliminaries 

In this paper we follow the standard graph model that is commonly used in the 
spectral clustering literature. We reiterate some of the definitions and properties 
in this section, such as graph Laplacian, normalized min-cut, eigendecomposition 
and so forth, to make this paper self-contained. Readers who are familiar with 
the materials can skip to our formulation in Section Important notations used 
throughout the rest of the paper are listed in Table [T] 

A collection of A'' data instances is modeled by an undirected, weighted graph 
G{V,£,A), where each data instance corresponds to a vertex (node) in V; £ is the 
edge set and A is the associated afhnity matrix. A is symmetric and non-negative. 
The diagonal matrix D = diag(_Dii, . . . , Dpjj^) is called the degree matrix of graph 
Q, where 

N 

^ii — ^ ^ Aij . 

Then 

L = D~ A 

is called the unnormalized graph Laplacian of Q. Assuming Q is connected (i.e. 
any node is reachable from any other node), L has the following properties: 
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Property 1 (Properties of graph Laplacian ( von Luxburd (20o3))) Let L be 

the graph Laplacian of a connected graph, then: 

1. L is symmetric and positive semi- definite. 

2. L has one and only one eigenvalue equal to 0, and TV — 1 positive eigenvalues: 
= Aq < Ai < . . . < \n-i- 

3. 1 is an eigenvector of L with eigenvalue (1 is a constant vector whose entries are 
all 1). 



Shi and Mahkl (|200(]|) showed that the eigenvectors of the graph Laplacian can 



be related to the normalized min-cut (Ncut) of Q. The objective function can be 
written as: 

argminv Lv, s.t. v v = vol, v _L D ' 1. (1) 

Here 

L = D-^^^LD-'/^ 



is called the normalized graph Laplacian ( von Luxburd ( 20071) ): vol = X^iLi L>ii is 



the volume of Q; the first constraint v"^v = vol normalizes v; the second constraint 
V _L rules out the principal eigenvector of Z as a trivial solution, because 

it does not define a meaningful cut on the graph. The relaxed cluster indicator u 
can be recovered from v as: 

7-1-1/2 

u = D 'v. 

Note that the result of spectral clustering is solely decided by the affinity 
structure of graph Q as encoded in the matrix A (and thus the graph Laplacian 
L). We will then describe our extensions on how to incorporate side information 
so that the result of clustering will refiect both the affinity structure of the graph 
and the structure of the side information. 



4 A Flexible Framework for Constrained Spectral Clustering 

In this section, we show how to incorporate side information into spectral clustering 
as pairwise constraints. Our formulation allows both hard and soft constraints. 
We propose a new constrained optimization formulation for constrained spectral 
clustering. Then we show how to solve the objective function by converting it into 
a generalized eigenvalue system. 



4.1 The Objective Function 

We encode side information with an N x N constraint matrix Q. Traditionally, 
constrained clustering only accommodates binary constraints, namely Must-Link 
and Cannot-Link: 

r+1 ifML(j,i) 

Q^]=Q]^=l-l iiCL{i,j) 

[ no side information available 
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Let u e {— 1,+1}^ be a cluster indicator vector, where = +1 if node i belongs 
to cluster + and = — 1 if node i belongs to cluster — , then 

N N 

1=1 j=i 

is a measure of how well the constraints in Q are satisfied by the assignment u: 
the measure will increase by 1 if Qij = 1 and node i and j have the same sign in 
u; it will decrease by 1 if Qij = 1 but node i and j have different signs in u or 
Qij = — 1 but node i and j have the same sign in u. 

We extend the above encoding scheme to accommodate soft constraints by 
relaxing the cluster indicator vector u as well as the constraint matrix Q such 
that: 

Qij is positive if we believe nodes i and j belong to the same cluster; Qij is 
negative if we believe nodes i and j belong to different clusters; the magnitude of 
Qij indicates how strong the belief is. 

Consequently, u"^Qu becomes a real-valued measure of how well the constraints 
in Q are satisfied in the relaxed sense. For example, Qij < means we believe nodes 
i and j belong to different clusters; in order to improve u^^Qu, we should assign 
Ui and Uj with values of different signs; similarly, Qij > means nodes i and j are 
believed to belong to the same cluster; we should assign and u,,- with values of 
the same sign. The larger u Qu is, the better the cluster assignment u conforms 
to the given constraints in Q. 

Now given this real-valued measure, rather than trying to satisfy all the con- 
straints in Q individually, we can lower-bound this measure with a constant a G M: 

T 

u Qu > a. 

Following the notations in Eq.([T]), we substitute u with D^^^^v, above inequality 
becomes 

V Qv > Q, 



where 



D-^/^QD~'/^ 



is the normalized constraint matrix. 

We append this lower-bound constraint to the objective function of uncon- 
strained spectral clustering in Eq. Q and we have: 

Problem 1 (Constrained Spectral Clustering) Given a normalized graph Lapla- 
cian L, a normalized constraint matrix Q and a threshold a, we want to optimizes 
the following objective function: 

T ~ T ~ T 1/2 /\ 

argminv Lv, s.t. v Qv > Q, v v = vol, v ^ D ' 1. (2) 

Here v"^Lv is the cost of the cut we want to minimize; the first constraint v"^Qv > a 
is to lower bound how well the constraints in Q are satisfied; the second constraint 
v^'^v = vol normalizes v; the third constraint v ^ D^^^l rules out the trivial 
solution _D^/^1. Suppose v* is the optimal solution to Eq.Q, then u* = D^^/^v* 
is the optimal cluster indicator vector. 
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It is easy to see that the unconstrained spectral clustering in Eq.(IT]) is covered 
as a special case of Eq. Q where Q = I (no constraints are encoded) and a = vol 
(v^Qv > vol is trivially satisfied given Q = I and v"^v = vol). 



4.2 Solving the Objective Function 

To solve a constrained optimizatio n problem, we follow the Karush-Kuhn- Tucker 
Theorem (|Kuhn and Tucked (|l982l) ) to derive the necessary conditions for the op- 



timal solution to the problem. We can find a set of candidates, or feasible solutions, 
that satisfy all the necessary conditions. Then we choose the optimal solution 
among the feasible solutions using brute-force method, given the size of the feasi- 
ble set is finite and small. 

For our objective function in Eq.Q, we introduce the Lagrange multipliers as 
follows: 

yl(v. A, fi) = v^Lv — A(v"^Qv — a) — fi(y'^\r — vol). (3) 

Then according to the KKT Theorem, any feasible solution to Eq. ([2} must satisfy 
the following conditions: 

(Stationarity) Lv — AQv — /iv = 0, (4) 
(Primal feasibility) v^^Qv > a, v"^v = vol, (5) 
(Dual feasibility) A > 0, (6) 
(Complementary slackness) X{v'^Q\r — a) = 0. (7) 

Note that Eq.(|l]) comes from taking the derivative of Eq.(l3|) with respect to v. 
Also note that we dismiss the constraint v 7^ D^^^l at this moment, because it 
can be checked independently after we find the feasible solutions. 

To solve Eq.(|3|)-(I7|), we start with looking at the complementary slackness 
requirement in Eq.(I7|), which implies either A = or y^Qv — a = 0. If A = 0, we 
will have a trivial problem because the second term from Eq. ^ will be eliminated 
and the problem will be reduced to unconstrained spectral clustering. Therefore 
in the following we focus on the case where A 7^ 0. In this case, for Eq.© to hold 
v^Qv — a must be 0. Consequently the KKT conditions become: 

Lv - AQv -fiv = 0, (8) 

v"^v = vol, (9) 

v^Qv = a, (10) 

A>0,. (11) 

Under the assumption that a is arbitrarily assigned by user and A and /i are 
independent variables, Ea.([8l lTT|) cannot be solved explicitly, and it may produce 
infinite number of feasible solutions, if one exists. As a workaround, we temporarily 
drop the assumption that a is an arbitrary value assigned by the user. Instead, we 
assume a = v^Qv, i.e. a is defined as such that Ea. (|10p holds. Then we introduce 
an auxiliary variable, /3, which is defined as the ratio between p and A: 

pA-f^vol. (12) 
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Now we substitute Ea. (|12p into Eq.(l8l) we obtain: 

Lv - AQv + = 0, 

vol 

or equivalently: 

Lv = A((5-A/)v (13) 
vol 

We immediately notice that Eq. psp is a generalized eigenvalue problem for a given 

Next we show that the substitution of a with jS does not compromise our 
original intention of lower bounding y^Qv in Eq.Q. 

Lemma 1 /? < v^Qv. 

Proof Let 7 = v'^L-v, by left-hand multiplying v"^ to both sides of Ea. (fT3)l we have 

v'^Lv = Av'^(Q - — 7)v. 

vol 

Then incorporating Eq. ([9} and a = v^Qy we have 

7 = A(q-/3). 

Now recall that L is positive semi-definite (Property [T|), and so is L, which means 

7 = v^Lv > 0, Vv ^ D^/^l. 

Consequently, we have 

a-/3 = Y>0 => v^Qv = a> 13. 
A 

In summary, our algorithm works as follows (the exact implementation is shown 
in Algorithm [1} : 

1. Generating candidates: The user specifies a value for /3, and we solve the 
generalized eigenvalue system given in Ea. (|13p . Note that both L and Q—/3/volI 
are Hermitian matrices, thus the generalized eigenvalues are guaranteed to be 
real numbers. 

2. Finding the feasible set: Removing generalized eigenvectors associated with 
non-positive eigenvalues, and normalize the rest such that v"^v = vol. Note 
that the trivial solution D^^^l is automatically removed in this step because 
if it is a generalized eigenvector in Eq. (|13p , the associated eigenvalue would be 
0. Since we have at most A*' generalized eigenvectors, the number of feasible 
eigenvectors is at most A''. 

3. Choosing the optimal solution: We choose from the feasible solutions the 
one that minimizes w'^Lw, say v*. 

According to Lemma[Tl v* is the optimal solution to the objective function in 
Eq.dU for any given P and /3 < q = w*'^ Qw* . 
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Algorithm 1: Constrained Spectral Clustering 

Input: Affinity matrix A, constraint matrix Q, /3; 
Output: The optimal (relaxed) cluster indicator u*; 

1 vol ^ E£i Ef=i A^j, D ^ diag(Ef=i A,,); 

3 \max{Q) <— the largest eigenvalue of Q; 

4 if (3 > \max{Q) ■ vol then 

5 I return u* = 0; 

6 end 

7 else 

8 Solve the generalized eigenvalue system in Eg JlSII : 

9 Remove eigenvectors associated with non-positive eigenvalues and normalize the 
rest by vf~ -j^Vvol; 

10 V* <— argmin^ v-^Lv, where v is among the feasible eigenvectors generated in the 
previous step; 

11 return u* -f— D^^/^v*; 

12 end 



4.3 A Sufficient Condition for the Existence of Solutions 

On one hand, our method described above is guaranteed to generate a finite num- 
ber of feasible solutions. On the other hand, we need to set /3 appropriately so that 
the generalized eigenvalue system in Eg. HlSp combined with the KKT conditions in 
Ea.([8l fTT|) will give rise to at least one feasible solution. In this section, we discuss 
such a sufficient condition: 

P < >^max{Q) ■ vol, 

where \max{Q) is the largest eigenvalue of Q. In this case, the matrix on the right 
hand side of Eg. psp . namely Q — P/vol-I, will have at least one positive eigenvalue. 
Consequently, the generalized eigenvalue system in Ea. (|13p will have at least one 
positive eigenvalue. Moreover, the number of feasible eigenvectors will increase if 
we make /? smaller. For example, if we set /3 < Xmin{Q)vol, Xmin{Q) to be the 
smallest eigenvalue of Q, then Q — p/vol ■ I becomes positive definite. Then the 
generalized eigenvalue system in Eq. (|13p will generate N ~ 1 feasible eigenvectors 
(the trivial solution with eigenvalue is dropped). 

In practice, we normally choose the value of /3 within the range 

{^miniQ) ■ V0l,\max{Q) ■ vol). 

In that range, the greater (3 is, the more the solution will be biased towards satis- 
fying Q. Again, note that whenever we have /3 < \max{Q) ■ vol, the value of a will 
always be bounded by 

P < a < XmaxVol. 

Therefore we do not need to take care of a explicitly. 
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Fig. 2 An illustrative example; the affinity structure says {1, 2, 3} and {4, 5, 6} while the node 
labeling (coloring) says {1,2,3,4} and {5,6}. 

4.4 An Illustrative Example 

To illustrate how our algorithm works, we present a toy example as follows. In 
Fig. [21 we have a graph associated with the following affinity matrix: 



110 
10 10 
110 10 
10 11 
10 1 
110 



Unconstrained spectral clustering will cut the graph at edge (3, 4) and split it into 
two symmetric parts {1,2,3} and {4,5,6} (Fig. 



3(a)). 



Then we introduce constraints as encoded in the following constraint matrix: 



+1 +1 +1 +1 -1 -1 
+1 +1 +1 +1 -1 -1 
+1 +1 +1 +1 -1 -1 
+1 +1 +1 +1 -1 -1 
-1 -1 -1 -1 +1 +1 
-1 -1 -1 -1 +1 +1 



Q is essentially saying that we want to group nodes {1, 2, 3, 4} into one cluster and 
{5,6} the other. Although this kind of "complete information" constraint matrix 
does not happen in practice, we use it here only to make the result more explicit 
and intuitive. 

Q has two distinct eigenvalues: and 2.6667. As analyzed above, /3 must be 
smaller than 2.6667uoZ to guarantee the existence of a feasible solution, and larger 
P means we want more constraints in Q to be satisfied (in a relaxed sense). Thus 
we set p to vol and 2vol respectively, and see how it will affect the resultant 
constrained cuts. We solve the generalized eigenvalue system in Eq. (|13p . and plot 
the cluster indicator vector u* in Fig. 3(b) and 3(c) respectively. We can see that 
as P increases, node 4 is dragged from the group of nodes {5, 6} to the group of 
nodes {1,2,3}, which conforms to our expectation that greater /3 value implies 
better constraint satisfaction. 
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(a) Unconstrained (b) Constrained, /3 = vol (c) Constrained, /3 = 2vol 



Fig. 3 The solutions to the illustrative example in Fig. [2] with different (5. The x-axis is the 
indices of the instances and the y-axis is the corresponding entry values in the optimal (relaxed) 
cluster indicator u*. Notice that node 4 is biased toward nodes {1,2,3} as /3 increases. 



5 Interpretations of Our Formulation 

5.1 A Graph Cut Interpretation 

Unconstrained spectral clustering can be interpreted as finding the Ncut of an 
unlabeled graph. Similarly, our formulation of constrained spectral clustering in 
Eq.([2]) can be interpreted as finding the Ncut of a labeled/colored graph. 

Specifically, suppose we have an undirected weighted graph. The nodes of the 
graph are colored in such a way that nodes of the same color are advised to 
be assigned into the same cluster while nodes of different colors are advised to 
be assigned into different clusters (e.g. Fig. [2|). Let v* be the solution to the 
constrained optimization problem in Eq. ^ . We cut the graph into two parts based 
on the values of the entries of u* = Z)~^/^v*. Then v*"^Lv* can be interpreted as 
the cost of the cut (in a relaxed sense), which we minimize. On the other hand, 

a = V Qv = u Qu 

can be interpreted as the purity of the cut (also in a relaxed sense), according 
to the color of the nodes in respective sides. For example, if Q G {—1,0, i}^^^ 
and u* G {—1,1}''^, then a equals to the number of constraints in Q that are 
satisfied by u* minus the number of constraints violated. More generally, if Qij is 
a positive number, then u* and u* having the same sign will contribute positively 
to the purity of the cut, whereas different signs will contribute negatively to the 
purity of the cut. It is not difficult to see that the purity can be maximized when 
there is no pair of nodes with different colors that are assigned to the same side 
of the cut (0 violations) , which is the case where all constraints in Q are satisfied. 



5.2 A Geometric Interpretation 

We can also interpr et our formulation as constraining the joint numerical range 
( Horn and JohnsonI 
Specifically, we consider the joint numerical range: 

J(Z, Q) 4 {(v'^Zv, v^Qv) : v'^v = 1}. (14) 

J(L, Q) essentially maps all possible cuts v to a 2-D plane, where the x-coordinate 
corresponds to the cost of the cut, and the j/-axis corresponds to the constraint 



19901) ') of the graph Laplacian and the constraint matrix. 
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(a) The unconstrained Ncut 



(b) The constrained Ncut 
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Fig. 4 The joint numerical range of the normalized graph Laplacian L and the normalized 
constraint matrix Q, as well as the optimal solutions to unconstrained/constrained spectral 
clustering. 



satisfaction of the cut. According to our objective in Eq.(l2j, we want to minimize 
the first term while lower-bounding the second term. Therefore, we are looking for 
the leftmost point among those that are above the horizontal line y = a. 

In Fig. lU^c), we visualize J{L,Q) by plotting all the unconstrained cuts given 
by spectral clustering and all the constrained cuts given by our algorithm in the 
joint numerical range, based on the graph Laplacian of a Two-Moon dataset with 
a randomly generated constraint matrix. The horizontal line and the arrow indi- 
cate the constrained area from which we can select feasible solutions. We can see 
that most of the unconstrained cuts proposed by spectral clustering are far below 
the threshold, which suggests spectral clustering cannot lead to the ground truth 
partition (as shown in Fig. |Ub)) without the help of constraints. 
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6 Implementation and Extensions 

In this section, we discuss some implementation issues of our method. Then we 
show how to generalize it to i^-way partition where K >2. 



6.1 Constrained Spectral Clustering for 2- Way Partition 

The routine of our method is similar to that of unconstrained spectral clustering. 
The input of the algorithm is an affinity matrix A, the constraint matrix Q, and 
a threshold /3. Then we solve the generalized eigenvalue problem in Ea. (|13p and 
find all the feasible generalized eigenvectors. The output is the optimal (relaxed) 
cluster assignment indicator u*. In practice, a partition is often derived from u* by 
assigning nodes corresponding to the positive entries in u* to one side of the cut, 
and negative entries to the other side. Our algorithm is summarized in Algorithm[TJ 

Since our model encodes soft constraints as degree of belief, inconsistent con- 
straints in Q will not corrupt our algorithm. Instead, they are enforced implicitly 
by maximizing Q\i. Note that if the constraint matrix Q is generated from a 
partial labeling, then the constraints in Q will always be consistent. 

Runtime analysis: The runtime of our algorithm is dominated by that of the 
generalized eigendecomposition. In other words, the complexity of our algorithm 
is on a par with that of unconstrained spectral clustering in big-0 notation, which 
is 0{kN''^), N to be the number of data instances and k to be the number of 
eigenpairs we need to compute. Here is a number large enough to guarantee the 
existence of feasible solutions. In practice we normally have 2 < fc <C iV. 



6.2 Extension to K-Way Partition 

Our algorithm can be naturally extended to K-way partiti on for K > 2, f ollowing 
what we usually do for unconstrained spectral clustering I von Luxburgj (.2007)): 
instead of only using the optimal feasible eigenvector u*, we preserve iop-{K — 1) 
eigenvectors associated with positive eigenvalues, and perform K-vae&na algorithm 
based on that embedding. 

Specifically, the constraint matrix Q follows the same encoding scheme: Qij > 
if node i and j are believed to belong to the same cluster, Qy < otherwise. To 
guarantee we can find K — 1 feasible eigenvectors, we set the threshold /3 such that 

/3 < Xx-ivol, 

where Xk-i is the {K — l)-th largest eigenvalue of Q. Given all the feasible eigen- 
vectors, we pick the top K ~ 1 in terms of minimizing w'^Lw 0. Let the K — \ 
eigenvectors form the columns of 1/ G R^^(^^^'. We perform A'-means clustering 
on the rows of V and get the final clustering. Algorithm [2] shows the complete 
routine. 

Note that A-means is only one of many possible discretization techniques that 
can derive a A- way partition from the relaxed indicator matrix D^^/'^V* . Due to 



Here we assume the trivial solution, the eigenvector with all I's, has been excluded. 
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Algorithm 2: Constrained Spectral Clustering for K-w&y Partition 

Input: Affinity matrix A, constraint matrix Q, /3, K\ 
Output: The cluster assignment indicator u*; 
1 vol ^ Etl Ef=i D ^ diag(Ef^i A,,); 

3 Xmax the largest eigenvalue of Q\ 
A if 13 > Xji_\vol then 

5 I return u* = 0; 

6 end 

7 else 

8 Solve the generalized eigenvalue system in Ea. lll3ll : 

9 Remove eigenvectors associated with non-positive eigenvalues and normalize the 
rest by V ^ ^v^; 

V* <— argmin^gjjjvx(iir-i) trace(y-^Ly), where the columns of V are a subset of 
the feasible eigenvectors generated in the previous step; 
return u 



11 
12 end 



the orthogonality of the eigenvectors, they can be independently discretized first, 
e.g. we can replace Step 11 of Algorithmic] with: 

u* ^ kmeans(sign(D"^/V*),i('). (15) 

This additional step can help alleviate the influence of possible outliers on the 
Jf-means step in some cases. 

Moreover, notice that the feasible eigenvectors, which are the columns of V* , 
are treated equally in Ea. (|15p . This may not be ideal in practice because these 
candidate cuts are not equally favored by graph Q, i.e. some of them have higher 
costs than the other. Therefore, we can weight the columns of V* with the inverse 
of their respective costs: 

u* ^kmeans(sign(D"^/V*(V^*^ZV^*)"^),74:). (16) 



6.3 Using Constrained Spectral Clustering for Transfer Learning 

The constrained spectral clustering framework naturally fits into the scenario of 
transfer learning between two graphs. Assume we have two graphs, a source graph 
and a target graph, which share the same set of nodes but have different sets of 
edges (or edge weights) . The goal is to transfer knowledge from the source graph 
so that we can improve the cut on the target graph. The knowledge to transfer is 
derived from the source graph in the form of soft constraints. 

Specifically, let Qs{V, Eg) be the source graph, Qt{V, Et) the target graph. Ag 
and At are their respective affinity matrices. Then Ag can be considered as a con- 
straint matrix with only ML constraints. It carries the structural knowledge from 
the source graph, and we can transfer it to the target graph using our constrained 
spectral clustering formulation: 

argminv I/tv, s.t. v Agv > a, v v = vol, v ^ Djl 1. (17) 
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a is now the lower bound of how much knowledge from the source graph must be 
enforced on the target graph. To solution to this is similar to Ea. (|13p : 



Ltv = \{As - -Itt-J)^ (18) 

VOl(C/jnj 

Note that since the largest eigenvalue oi Ag corresponds to a trivial cut, in practice 
we should set the threshold such that /3 < Xivol, Ai to be the second largest 
eigenvalue of As- This will guarantee a feasible eigenvector that is non-trivial. 



7 Testing and Innovative Uses of Our Work 

We begin with three sets of experiments to test our approach on standard spec- 
tral clustering data sets. We then show that since our approach can handle large 
amounts of soft constraints in a flexible fashion, this opens up two innovative uses 
of our work: encoding multiple metrics for translated document clustering and 
transfer learning for fMRI analysis. 

We aim to answer the following questions with the empirical study: 

— Can our algorithm effectively incorporate side information and generate se- 
mantically meaningful partitions? 

— Does our algorithm converge to the underlying ground truth partition as more 
constraints are provided? 

— How does our algorithm perform on real-world datasets, as evaluated against 
ground truth labeling, with comparison to existing techniques? 

— How well does our algorithm handle soft constraints? 

— How well does our algorithm handle large amounts of constraints? 

Recall that in Section [1] we listed four different types of side information: 
explicit pairwise constraints, partial labeling, alternative metrics, and transfer of 
knowledge. The empirical results presented in this section are arranged accordingly. 

All but one (the fMRI scans) datasets used in our experiments are publicly 
available online. We implemented our algorithm in MATLAB, which is available 
online at http: //bayou, cs .ucdavis . edu/ or by contacting the authors. 



7.1 Explicit Pairwise Constraints: Image Segmentation 

We demonstrate the effectiveness of our algorithm for image segmentation using 
explicit pairwise constraints assigned by users. 

We choose the image segmentation application for several reasons: 1) it is one 
of the applications where spectral clustering significantly outperforms other clus- 
tering techniques, e.g. A'-means; 2) the results of image segmentation can be easily 
interpreted and evaluated by human; 3) instead of generating random constraints, 
we can provide semantically meaningful constraints to see if the constrained par- 
tition conforms to our expectation. 

The image s we used were chosen from the Berkeley Segmentation Dataset and 
Benchmark (,Martin et al (2001}). The original images are 480 x 320 grayscale im- 
ages in jpeg format. For efficiency consideration, we compressed them to 10% of 



the original size, which is 48 x 32 pixels, as shown in Fig. 5(a) and 6(a) Then 
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(a) Original image (b) No constraints 
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(c) Constraint Set 1 (d) Constraint Set 2 

Fig. 5 Segmentation of the elephant image. The images are reconstructed based on the relaxed 
cluster indicator u* . Pixels that are closer to the red end of the spectrum belong to one segment 
and blue the other. The labeled pixels are as bounded by the black and white rectangles. 
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(a) Original image (b) No constraints (c) Constraint Set 1 (d) Constraint Set 2 

Fig. 6 Segmentation of the face image. The images are reconstructed based on the relaxed 
cluster indicator u* . Pixels that are closer to the red end of the spectrum belong to one segment 
and blue the other. The labeled pixels are as bounded by the black and white rectangles. 



affinity matrix of the image was computed using tfie RBF kernel, based on both 
the positions and the graysc ale values of the pixels . As a baseline, we used uncon- 
strained spectral clustering ( Shi and MalikI (20o3)) to generate a 2-segmentation 
of the image. Then we introduced different sets of constraints to see if they can 
generate expected segmentation. Note that the results of image segmentation vary 
with the number of segments. To save us from the complications of parameter tun- 
ing, which is irrelevant to the contribution of this work, we always set the number 
of segments to be 2. 

The results are shown in Fig. [5] and (6] To visualize the resultant segmentation, 
we reconstructed the image using the entry values in the relaxed cluster indicator 
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vector u*. In Fig. |5(b)[ the unconstrained spectral clustering partitioned the ele- 
phant image into two parts: the sky (red pixels) and the two elephants and the 
ground (blue pixels) . This is not satisfying in the sense that it failed to isolate the 
elephants from the background (the sky and the ground). To correct this, we in- 
troduced constraints by labeling two 5x5 blocks to be 1 (as bounded by the black 



rectangles in Fig. 5(c) ): one at the upper-right corner of the image (the sky) and 
the other at the lower-right corner (the ground); we also labeled two 5x5 blocks 
on the heads of the two elephants to be —1 (as bounded by the white rectangles in 



Fig. 5(c)). To generate the constraint matrix Q, a ML was added between every 
pair of pixels with the same label and a CL was added between every pair of pixels 
with different labels. The parameter /3 was set to 



/3 = \max X vol X (0.5 -|- 0.4 X 



# of constraints 

iV2 



(19) 



where \max is the largest eigenvalue of Q. In this way, /? is always between 
O.^Xmaxvol and Q.QXmaxVol, and it will gradually increase as the number of con- 



straints increases. From Fig. 5(c) we can see that with the help of user supervision, 
our method successfully isolated the two elephants (blue) from the background, 
which is the sky and the ground (red). Note that our method achieved this with 
very simple labeling: four squared blocks. 

To show the flexibility of our method, we tried a different set of constraints on 
the same elephant image with the same parameter settings. This time we aimed 
to separate the two elephants from each other, which is impossible in the uncon- 
strained case because the two elephants are not only similar in color (grayscale 
value) but also adjacent in space. Again we used two 5x5 blocks (as bounded by 



the black and white rectangles in Fig. 5(d) ), one on the head of the elephant on 
the left, labeled to be 1, and the other on the body of the elephant on the right, 
labeled to be —1. As shown in Fig. 5(d) our method cut the image into two parts 
with one elephant on the left (blue) and the other on the right (red), just like what 
a human user would do. 



Similarly, we applied our method on a human face image as shown in Fig. 6(a) 



The unconstrained spectral clustering failed to isolate the human face from the 



background (Fig. 6(b) I. This is because the tall hat breaks the spatial continuity 



between the left side of the background and the right side. Then we labeled two 
5x3 blocks to be in the same class, one on each side of the background. As we 
intended, our method assigned the background of both sides into the same cluster 



and thus isolated the human face with his tall hat from the background(Fig. 6(c) I. 
Again, this was achieved simply by labeling two blocks in the image, which covered 
about 3% of all pixels. Alternatively, if we labeled a 5 x 5 block in the hat to be 
1, and a 5 X 5 block in the face to be —1, the resultant clustering will isolate the 



hat from the rest of the image (Fig. 6(d) I 



7.2 Explicit Pairwise Constraints: The Double Moon Dataset 

We further examine the behavior of our algorithm on a synthetic dataset using 
explicit constraints that are derived from underlying ground truth. 

We claim that our formulation is a natural extension to spectral clustering. The 
question to ask then is whether the output of our algorithm converges to that of 
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(a) A Double Moon sample and its Ncut (b) The convergence of our algorithm 



Fig. 7 The convergence of our algorithm on 10 random samples of the Double Moon distri- 
bution. 



spectral clustering. More specifically, consider the ground truth partition defined 
by performing spectral clustering on an ideal distribution. We draw an imperfect 
sample from the distribution, on which spectral clustering is unable to find the 
ground truth partition. Then we perform our algorithm on this imperfect sample. 
As more and more constraints are provided, we want to know whether or not the 
partition found by our algorithm would converge to the ground truth partition. 



To answer the question, we used the Double Moon distribution. As shown in 
Fig. [TJ spectral clustering is able to find the two moons when the sample is dense 
enough. In Fig.[7]Ja), we generated an under-sampled instance of the distribution 
with 100 data points, on which unconstrained spectral clustering could no longer 
find the ground truth partition. Then we performed our algorithm on this im- 
perfect sample, and compared the partition found by our algorithm to the groun d 
truth partition in terms of adjusted Rand index (ARI. [Hubert and Arabi^ ( 19851) V 
ARI indicates how well a given partition conform to the ground truth: means 
the given partition is no better than a random assignment; 1 means the given par- 
tition matches the ground truth exactly. For each random sample, we generated 
50 random sets of constraints and recorded the average ARI. We repeated the 
process on 10 different random samples of the same size and reported the results 
in Fig. [TJb). We can see that our algorithm consistently converge to the ground 
truth result as more constraints are provided. Notice that there is performance 
drop when an extreme small number of constraints are provided (less than 10), 
which is expected because such small number of constraints are insufficient to hint 
a better partition, and consequentially lead to random perturbation to the results. 
As more constraints were provided, the results were quickly stabilized. 



To illustrate the robustness of the our approach, we created a Double Moon 
sample with uniform background noise, as shown in Fig. [S] Although the sample 
is dense enough (600 data instances in total), spectral clustering fails to find the 
correctly identify the two moons, due to the infiuence of background noise (100 
data instances). However, with 20 constraints, our algorithm is able to recover the 
two moons in spite of the background noise. 
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(a) Spectral Clustering 



(b) Constrained Spectral Clustering 



Fig. 8 The partition of a noisy Double Moon sample. 



Table 2 The UCI benchmarks 



Identifier 


^Instances 


#Attributes 


Hepatitis 


80 


19 


Iris 


100 


4 


Wine 


130 


13 


Glass 


214 


9 


Ionosphere 


351 


34 


WDBC 


569 


30 



7.3 Constraints from Partial Labeling: Clustering the UCI Benchmarks 

Next we evalu ate the performance of our alg orithm by clustering the UCI bench- 
mark datasets ( Asuncion and Newmanl l 20071) ') with constraints derived from ground 
truth labeling. 

We chose six different datasets with class label information, namely Hepati- 
tis, Iris, Wine, Glass, Ionosphere and Breast Cancer Wisconsin (Diagnostic). We 
performed 2-way clustering simply by partitioning the optimal cluster indicator 
according to the sign: positive entries to one cluster and negative the other. We 
removed the setosa class from the Iris data set, which is the class that is known 
to be well-separately from the other two. For the same reason we removed Class 
3 from the Wine data set, which is well-separated from the other two. We also 
removed data instances with missing values. The statistics of the data sets after 
preprocessing are listed in Table (2] 

For each data set, we computed the affinity matrix using the RBF kernel. To 
generate constraints, we randomly selected pairs of nodes that the unconstrained 
spectral clustering wrongly partitioned, and fill in the correct relation in Q accord- 
ing to ground truth labels. The quality of the clustering results was measured by 
adjusted Rand index. Since the constraints are guaranteed to be correct, we set 
the threshold /3 such that there will be only one feasible eigenvector, i.e. the one 
that best conforms to the constraint matrix Q. 

In addition to comparing our algorithm (CSP) to unconstrained spectral clus- 
tering, we implemented two state-of-the-art techniques: 



Spectral Learning (SL) (iKamvar et all ( 20031) ) modifies the affinity matrix of 



the original graph directly: Aij is set to 1 if there is a ML between node i and 
j, for CL. 
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Semi-Supervised Kernel J^T-means (SSKK) (|Kulis et all ^200^ ) adds penalties 



to the affinity matrix based on the given constraints, and then performs kernel 
Ji'-means on the new kernel to find the partition. 

We also tried the algorithm proposed by IYu and Shil (|200ll [ 20041) . which en- 



codes pa rtial grouping information as a projection matrix, the subspace trick pro- 
posed by De Bie et all (12004) . and the affinity propagation algorithm proposed by 



ILu and Carreira-Perpinanl (j2008l) . However, we were not able to use these algo- 
rithms to achieve better results than SL and SSKK, hence their results are not 
reported. Xu et al (2005) proposed a variation of SL, where the constraints are 
encoded in the same way, but instead of the normalized graph Laplacian, they 
suggested to use the random walk matrix. We performed their algorithm on the 
UCI datasets, which produced largely identical results to that of SL. 

We report the adjusted Rand index against the number of constraints used 
(ranging from 50 to 500) so that we can see how the quality of clustering varies 
when more constraints are provided. At each stop, we randomly generated 100 
sets of constraints. The mean, maximum and minimum ART of the 100 random 
trials are reported in Fig. |9| We also report the ratio of the constraints that were 
satisfied by the constrained partition in Fig. 1101 The observations are: 

— Across all six datasets, our algorithm is able to effectively utilize the con- 
straints and improve over unconstrained spectral clustering (Baseline) . On the 
one hand, our algorithm can quickly improve the results with a small amount 
of constraints. On the other hand, as more constraints are provided, the per- 
formance of our algorithm consistently increases and converges to the ground 
truth partition (Fig. |9|). 

— Our algorithm outperforms the competitors by a large margin in terms of 
ARI (Fig.|9|). Since we have control over the lower-bounding threshold a, our 
algorithm is able to satisfy almost all the given constraints (Fig. llOp. 

— The performance of our algorithm has significantly smaller variance over dif- 
ferent random constraint sets than its competitors (Fig. O and [TU]), and the 
variance quickly diminishes as more constraints are provided. This suggests 
that our algorithm would perform more consistently in practice. 

— Our algorithm performs especially well on sparse graphs, i.e. Fig.|9Ue)(f), where 
the competitors suffer. The reason is that when the graph is too sparse, it im- 
plies many "free" cuts that are equally good to unconstrained spectral cluster- 
ing. Even after introducing a small number of constraints, the modified graph 
remains too sparse for SL and SSKK, which are unable to identify the ground 
truth partition. In contrast, these free cuts are not equivalent when judged by 
the constraint matrix Q of our algorithm, which can easily identify the one cut 
that minimizes v^^Qv. As a result, our algorithm outperforms SL and SSKK 
significantly in this scenario. 



7.4 Constraints from Alternative Metrics: The Reuters Multilingual Dataset 

We test our algorithm using soft constraints derived from alternative metrics of 
the same set of data instances. 

W e used the Reuters Multilingual dataset, first introduced by lAmini et all 
(|2009f ). 'ilach time we randomly sampled 1000 documents which were originally 
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Fig. 9 The performance of our algorithm (CSP) on six UCI datasets, with comparison to un- 
constrained spectral clustering (Baseline) and the Spectral Learning algorithm (SL). Adjusted 
Rand index over 100 random trials is reported (mean, min, and max). 



written in one language and then translated into four others, respectively. The 
statistics of the dataset is listed in Table [3l These documents came with ground 
truth labels that categorize them into six topics [K = 6). We constructed one 
graph based on the original language, and another graph based on the transla- 
tion. The affinity matrix was the cosine similarity between the tf-idf vectors of 
two documents. Then we used one of the two graphs as the constraint matrix Q, 
whose entries can then be viewed as soft ML constraints. We enforce this con- 
straint matrix to the other graph to see if it can help improve the clustering. We 
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did not compare our algorithm to existing techniques because they are unable to 
incorporate soft constraints. 

As shown in Fig. Illl unconstrained spectral clustering performs better on the 
original version than the translated versions, which is not surprising. If we use the 
original version as the constraints and enforce that onto a translated version using 
our algorithm, we yield a constrained clustering that is not only better than the 
unconstrained clustering on the translated version, but also even better than the 
unconstrained clustering on the original version. This indicates that our algorithm 
is not merely a tradeoff between the original graph and the given constraints. 
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Table 3 The Reuters Multilingual dataset 



Language 


^Documents 


#Words 


English 


2000 


21,531 


French 


2000 


24,893 


German 


2000 


34,279 


Italian 


2000 


15,506 


Spanish 


2000 


11,547 




I Translalion 
Trans. -> FR 
FR -4 Trans. 



(a) English Documents and Translations (b) French Documents and Translations 
Fig. 11 The performance of our algorithm on Reuters Multilingual dataset. 



Instead it is able to integrate the knowledge from the constraints into the original 
graph and achieve a better partition. 



7.5 Transfer of Knowledge: Resting-State fMRI Analysis 



Finally we apply our algorithm to transfer learning on the resting-state fMRI data. 

An fMRI scan of a person consists of a sequence of 3D images over time. We 
can construct a graph from a given scan such that a node in the graph corresponds 
to a voxel in the image, and the edge weight between two nodes is (the absolute 
value of) the correlation between the two time sequences associated with the two 
voxels. Previous work has shown that by applying spectral clustering to the resting- 
state fMRI we can find the substructures in the brain that are periodically and 
simultaneously activated over time in th e resting state, which may indicate a 
network associated with certain functions ( van den Heuvel et a 

One of the challenges of resting-state fMRI analysis is instability. Noise can be 
easily introduced into the scan result, e.g. the subject moved his/her head during 
the scan, the subject was not at resting state (actively thinking about things during 
the scan), etc. Consequently, the result of spectral clustering becomes instable. If 
we apply spectral clustering to two fMRI scans of the same person on two different 
days, the normalized min-cuts on the two different scans are so different that they 
provide little insight into the brain activity of the subject (Fig.[T2|a) and (b)). To 
overcome this problem, we use our formulation to transfer knowledge from Scan 1 
to Scan 2 and get a constrained cut, as shown in Fig. [T2jc). This cut represents 
what the two scans agree on. The pattern captured by Fig. [12jc) is actually the 
default mode network (DMN), which is the network that is periodically activated at 
resting state (Fig.[T2jd) shows the idealized DMN as specified by domain experts). 

To further illustrate the practicability of our approach, we transfer the idealized 
DMN in Fig. I12f d) to a set of fMRI scans of elderly subjects. The dataset was 
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(c) Constrained cut by transferring Scan 1 (d) An idealized default mode network 
to 2 

Fig. 12 Transfer learning on fMRI scans. 

collected and processed within the research program of the University of California 
at Davis Alzheimer's Disease Center (UCD ADC). The subjects were categorized 
into two groups: those diagnosed with cognitive syndrome (20 individuals) and 
those without cognitive syndrome (11 individuals). For each individual scan, we 
encode the idealized DMN into a constraint matrix (using the RBF kernel), and 
enforce the constraints onto the original fMRI scan. We then compute the cost 
of the constrained cut that is the most similar to the DMN. If the cost of the 
constrained cut is high, it means there is great disagreement between the original 
graph and the given constraints (the idealized DMN), and vice versa. In other 
words, the cost of the constrained cut can be interpreted as the cost of transferring 
the DMN to the particular fMRI scan. 

In Fig. 1131 we plot the costs of transferring the DMN to both subject groups. 
We can clearly see that the costs of transferring the DMN to people without 
cognitive syndrome tend to be lower than to people with cognitive syndrome. This 
conforms well to the observation made in a recent study that the DMN is often 
disrupted for people with the Alzheimer's disease (jBuckner et all (|2008l) V 

8 Conclusion 

In this work we proposed a principled and flexible framework for constrained spec- 
tral clustering that can incorporate large amounts of both hard and soft con- 
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Fig. 13 The costs of transferring the idealized default mode network to the fMRI scans of 
two groups of elderly individuals. 



straints. The flexibility of our framework lends itself to the use of all types of 
side information: pairwise constraints, partial labeling, alternative metrics, and 
transfer learning. Our formulation is a natural extension to unconstrained spec- 
tral clustering and can be solved efficiently using generalized eigendecomposition. 
We demonstrated the effectiveness of our approach on a variety of datasets: the 
synthetic Two-Moon dataset, image segmentation, the UCI benchmarks, the mul- 
tilingual Reuters documents, and resting-state fMRI scans. The comparison to 
existing techniques validated the advantage of our approach. 
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